Tutorial Brief

Pandas is powerful and easy-to-use library for data analysis. Is has two main object to represents data: Series and DataFrame.

Finding Help:

NumPy

Base N-dimensional array package

SciPy

Fundamental library for scientific computing

Matplotlib

Comprehensive 2D Plotting

IPython

Enhanced Interactive Console

SymPy

Symbolic mathematics

Pandas

Data structures & analysis

Import libraries


In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Working with Series

Series is an array like object.

pd.Series(self, data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)


In [2]:
x = pd.Series([1,2,3,4,5])
x


Out[2]:
0    1
1    2
2    3
3    4
4    5
dtype: int64

Notice that generated an index for your item

Basic Operation


In [3]:
x + 100


Out[3]:
0    101
1    102
2    103
3    104
4    105
dtype: int64

In [4]:
(x ** 2) + 100


Out[4]:
0    101
1    104
2    109
3    116
4    125
dtype: int64

In [5]:
x > 2


Out[5]:
0    False
1    False
2     True
3     True
4     True
dtype: bool

any() and all()


In [6]:
larger_than_2 = x > 2
larger_than_2


Out[6]:
0    False
1    False
2     True
3     True
4     True
dtype: bool

In [7]:
larger_than_2.any()


Out[7]:
True

In [8]:
larger_than_2.all()


Out[8]:
False

apply()


In [9]:
def f(x):
    if x % 2 == 0:
        return x * 2
    else:
        return x * 3

x.apply(f)


Out[9]:
0     3
1     4
2     9
3     8
4    15
dtype: int64

Avoid looping over your data

This is a %%timeit results from apply() and a for loop.


In [10]:
%%timeit

ds = pd.Series(range(10000))

for counter in range(len(ds)):
    ds[counter] = f(ds[counter])


1 loops, best of 3: 241 ms per loop

In [11]:
%%timeit

ds = pd.Series(range(10000))

ds = ds.apply(f)


10 loops, best of 3: 40 ms per loop

astype()


In [12]:
x.astype(np.float64)


Out[12]:
0    1
1    2
2    3
3    4
4    5
dtype: float64

copy()


In [13]:
y = x

In [14]:
y[0]


Out[14]:
1

In [15]:
y[0] = 100

In [16]:
y


Out[16]:
0    100
1      2
2      3
3      4
4      5
dtype: int64

In [17]:
x


Out[17]:
0    100
1      2
2      3
3      4
4      5
dtype: int64

Avoid using copy (is you can) to save memory


In [18]:
y = x.copy()

In [19]:
x[0]=1

In [20]:
x


Out[20]:
0    1
1    2
2    3
3    4
4    5
dtype: int64

In [21]:
y


Out[21]:
0    100
1      2
2      3
3      4
4      5
dtype: int64

In [22]:
x.describe(percentile_width=50)


Out[22]:
count    5.000000
mean     3.000000
std      1.581139
min      1.000000
25%      2.000000
50%      3.000000
75%      4.000000
max      5.000000
dtype: float64

DataFrame

pd.DataFrame(self, data=None, index=None, columns=None, dtype=None, copy=False)


In [23]:
data = [1,2,3,4,5,6,7,8,9]
df = pd.DataFrame(data, columns=["x"])

In [24]:
df


Out[24]:
x
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9

9 rows × 1 columns

Selecting Data


In [25]:
df["x"]


Out[25]:
0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
Name: x, dtype: int64

In [26]:
df["x"][0]


Out[26]:
1

Adding extra columns


In [27]:
df["x_plus_2"] = df["x"] + 2
df


Out[27]:
x x_plus_2
0 1 3
1 2 4
2 3 5
3 4 6
4 5 7
5 6 8
6 7 9
7 8 10
8 9 11

9 rows × 2 columns


In [28]:
df["x_square"] = df["x"] ** 2
df["x_factorial"] = df["x"].apply(np.math.factorial)
df


Out[28]:
x x_plus_2 x_square x_factorial
0 1 3 1 1
1 2 4 4 2
2 3 5 9 6
3 4 6 16 24
4 5 7 25 120
5 6 8 36 720
6 7 9 49 5040
7 8 10 64 40320
8 9 11 81 362880

9 rows × 4 columns


In [29]:
df["is_even"] = df["x"] % 2
df


Out[29]:
x x_plus_2 x_square x_factorial is_even
0 1 3 1 1 1
1 2 4 4 2 0
2 3 5 9 6 1
3 4 6 16 24 0
4 5 7 25 120 1
5 6 8 36 720 0
6 7 9 49 5040 1
7 8 10 64 40320 0
8 9 11 81 362880 1

9 rows × 5 columns

map()


In [30]:
df["odd_even"] = df["is_even"].map({1:"odd", 0:"even"})
df


Out[30]:
x x_plus_2 x_square x_factorial is_even odd_even
0 1 3 1 1 1 odd
1 2 4 4 2 0 even
2 3 5 9 6 1 odd
3 4 6 16 24 0 even
4 5 7 25 120 1 odd
5 6 8 36 720 0 even
6 7 9 49 5040 1 odd
7 8 10 64 40320 0 even
8 9 11 81 362880 1 odd

9 rows × 6 columns

drop()


In [31]:
df = df.drop("is_even", 1)
df


Out[31]:
x x_plus_2 x_square x_factorial odd_even
0 1 3 1 1 odd
1 2 4 4 2 even
2 3 5 9 6 odd
3 4 6 16 24 even
4 5 7 25 120 odd
5 6 8 36 720 even
6 7 9 49 5040 odd
7 8 10 64 40320 even
8 9 11 81 362880 odd

9 rows × 5 columns

Multi Column Select


In [32]:
df[["x", "odd_even"]]


Out[32]:
x odd_even
0 1 odd
1 2 even
2 3 odd
3 4 even
4 5 odd
5 6 even
6 7 odd
7 8 even
8 9 odd

9 rows × 2 columns

Controlling display options


In [33]:
pd.options.display.max_columns= 60
pd.options.display.max_rows= 6
pd.options.display.notebook_repr_html = False
df


Out[33]:
   x  x_plus_2  x_square  x_factorial odd_even
0  1         3         1            1      odd
1  2         4         4            2     even
2  3         5         9            6      odd
3  4         6        16           24     even
4  5         7        25          120      odd
5  6         8        36          720     even
  ..       ...       ...          ...      ...

[9 rows x 5 columns]

Filtering


In [34]:
df[df["odd_even"] == "odd"]


Out[34]:
   x  x_plus_2  x_square  x_factorial odd_even
0  1         3         1            1      odd
2  3         5         9            6      odd
4  5         7        25          120      odd
6  7         9        49         5040      odd
8  9        11        81       362880      odd

[5 rows x 5 columns]

In [35]:
df[df.odd_even == "even"]


Out[35]:
   x  x_plus_2  x_square  x_factorial odd_even
1  2         4         4            2     even
3  4         6        16           24     even
5  6         8        36          720     even
7  8        10        64        40320     even

[4 rows x 5 columns]

Chaining Filters

| OR


In [36]:
df[(df.odd_even == "even") | (df.x_square < 20)]


Out[36]:
   x  x_plus_2  x_square  x_factorial odd_even
0  1         3         1            1      odd
1  2         4         4            2     even
2  3         5         9            6      odd
3  4         6        16           24     even
5  6         8        36          720     even
7  8        10        64        40320     even

[6 rows x 5 columns]

& AND


In [37]:
df[(df.odd_even == "even") & (df.x_square < 20)]


Out[37]:
   x  x_plus_2  x_square  x_factorial odd_even
1  2         4         4            2     even
3  4         6        16           24     even

[2 rows x 5 columns]

Furter Chaining


In [38]:
df[(df.odd_even == "even") & (df.x_square < 20)]["x_plus_2"][:1]


Out[38]:
1    4
Name: x_plus_2, dtype: int64

scatter_matrix()


In [39]:
pd.scatter_matrix(df, diagonal="kde", figsize=(10,10));



In [40]:
df.describe()


Out[40]:
              x  x_plus_2   x_square    x_factorial
count  9.000000  9.000000   9.000000       9.000000
mean   5.000000  7.000000  31.666667   45457.000000
std    2.738613  2.738613  28.080242  119758.341137
min    1.000000  3.000000   1.000000       1.000000
25%    3.000000  5.000000   9.000000       6.000000
50%    5.000000  7.000000  25.000000     120.000000
            ...       ...        ...            ...

[8 rows x 4 columns]

Reading Data from CSV/TSV Files


In [41]:
url = "http://www.google.com/finance/historical?q=TADAWUL:TASI&output=csv"
stocks_data = pd.read_csv(url)

In [42]:
stocks_data


Out[42]:
       Date      Open      High       Low     Close     Volume
0  11-Aug-14  10579.12  10603.30  10547.21  10596.55  197234714
1  10-Aug-14  10552.48  10614.11  10551.77  10579.12  199773735
2   7-Aug-14  10478.34  10585.38  10478.34  10552.48  202329194
3   6-Aug-14  10450.52  10494.12  10398.25  10478.34  192868941
4   5-Aug-14  10405.81  10501.38  10405.81  10450.52  287651475
5   4-Aug-14  10302.88  10409.47  10290.95  10405.81  223099538
         ...       ...       ...       ...       ...        ...

[241 rows x 6 columns]

In [43]:
stocks_data["change_amount"] = stocks_data["Close"] - stocks_data["Open"]
stocks_data["change_percentage"] = stocks_data["change_amount"] / stocks_data["Close"]
stocks_data


Out[43]:
       Date      Open      High       Low     Close     Volume  \
0  11-Aug-14  10579.12  10603.30  10547.21  10596.55  197234714   
1  10-Aug-14  10552.48  10614.11  10551.77  10579.12  199773735   
2   7-Aug-14  10478.34  10585.38  10478.34  10552.48  202329194   
3   6-Aug-14  10450.52  10494.12  10398.25  10478.34  192868941   
4   5-Aug-14  10405.81  10501.38  10405.81  10450.52  287651475   
5   4-Aug-14  10302.88  10409.47  10290.95  10405.81  223099538   
         ...       ...       ...       ...       ...        ...   

   change_amount  change_percentage  
0          17.43           0.001645  
1          26.64           0.002518  
2          74.14           0.007026  
3          27.82           0.002655  
4          44.71           0.004278  
5         102.93           0.009892  
             ...                ...  

[241 rows x 8 columns]